STOMP: confirm utf-8 handling (backport #13858) #13860

mergify · 2025-05-06T14:15:19Z

This is an intermediate conclusion/confirmation that out STOMP implementation can handle multi-byte characters in utf-8 encoding.

The question came up during Native STOMP review.

Frame parser collects bytes one by one into list and before transitioning to the next state reverses that acc list, so multi-bytes characters represented here with respective number of integers less than 255. In tests and in our code we work with headers via Erlang string literals that (at least with default source file encoding) accept unicode just fine and use utf8 as encoding. The tricky part here is that string literals are encoded as list of integers, not as list of bytes:

"headꙕr1" becomes [104,101,97,100,42581,114,49].

Binary literals without encoding:

binary_to_list(<<"headꙕr1">>).
[104,101,97,100,85,114,49] %% complete nonsense  from string perspective - truncated to 8 bits

and with:

binary_to_list(<<"headꙕr1"/utf8>>).
[104,101,97,100,234,153,149,114,49]

This last one list is exactly the list we get in frame parser.

It was confusing at the beginning until I realized I mostly fighting Erlang in tests. Newly added python test simply confirms utf-8 stuff relayed just fine. As for standard headers and our 'x-' extensions they all fit into ASCII so no problem here when we do look-ups for them using stomp_frame:header.

Bottom line:

we relay utf8 just fine, if we keep default encoding for our source files, our string literals in the code keep working.

PS.

Curiously, erlang's list_to_binary doesn't work with utf8 strings (unicode module must be used):

list_to_binary("headꙕr1").
** exception error: bad argument
     in function  list_to_binary/1
        called as list_to_binary([104,101,97,100,42581,114,49])
        *** argument 1: not an iolist term

I don't know yet if it means something for us outside STOMP, but in terms of unicode list_to_binary should be replaced with unicode:characters_to_binary:

unicode:characters_to_binary("headꙕr1").
<<104,101,97,100,234,153,149,114,49>>

However, all our protocol strings fit into first 128 ASCII codes so like we are just fine.

This is an automatic backport of pull request #13858 done by [Mergify](https://mergify.com).

(cherry picked from commit 0ec2599)

STOMP: confirm utf-8 handling (backport #13858) (cherry picked from commit 0aeca40)

STOMP: confirm utf-8 handling

0d284b0

(cherry picked from commit 0ec2599)

mergify bot assigned ikavgo May 6, 2025

michaelklishin added this to the 4.1.1 milestone May 6, 2025

michaelklishin merged commit 0aeca40 into v4.1.x May 6, 2025
271 checks passed

michaelklishin deleted the mergify/bp/v4.1.x/pr-13858 branch May 6, 2025 14:53

michaelklishin added a commit that referenced this pull request May 6, 2025

Merge pull request #13860 from rabbitmq/mergify/bp/v4.1.x/pr-13858

ebc0bbb

STOMP: confirm utf-8 handling (backport #13858) (cherry picked from commit 0aeca40)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

STOMP: confirm utf-8 handling (backport #13858) #13860

STOMP: confirm utf-8 handling (backport #13858) #13860

Uh oh!

mergify bot commented May 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

STOMP: confirm utf-8 handling (backport #13858) #13860

STOMP: confirm utf-8 handling (backport #13858) #13860

Uh oh!

Conversation

mergify bot commented May 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants